Building the Customer and Product Modelling

1 Load Data

We first want to load our datasets and prepare them for some simple association rules mining.

1.1 Load Transaction Data

tnx_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")

tnx_data_tbl %>% glimpse()
## Rows: 1,067,371
## Columns: 23
## $ row_id            <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet       <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id        <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code        <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description       <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity          <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date      <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price             <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id       <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country           <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr    <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm      <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month     <fct> December, December, December, December, December, De…
## $ invoice_dow       <fct> Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday…
## $ invoice_dom       <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour      <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute    <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy       <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym        <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value       <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

To use our rules mining we just need the invoice data and the stock code, so we can ignore the rest. Also, we ignore the issue of returns and just look at purchases.

tnx_purchase_tbl <- tnx_data_tbl %>%
  filter(
    quantity > 0,
    price > 0,
    exclude == FALSE
    ) %>%
  drop_na(customer_id) %>%
  select(
    invoice_id, invoice_date, stock_code, customer_id, quantity, price,
    stock_value, description
    )

tnx_purchase_tbl %>% glimpse()
## Rows: 802,651
## Columns: 8
## $ invoice_id   <chr> "489434", "489434", "489434", "489434", "489434", "489434…
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ stock_code   <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "…
## $ customer_id  <chr> "13085", "13085", "13085", "13085", "13085", "13085", "13…
## $ quantity     <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3…
## $ price        <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.7…
## $ stock_value  <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, …
## $ description  <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHT…

To build the association rules, we need to load the transactions in the format required for the arules package. We set a date for our dataset before which we wish to train our data and use the remainder as our model validation.

training_data_date <- as.Date("2011-03-31")

We now combine all this data to construct our association rules.

tnx_purchase_tbl %>%
  filter(invoice_date <= training_data_date) %>%
  select(invoice_id, stock_code) %>%
  write_csv("data/tnx_arules_input.csv")

basket_tnxdata <- read.transactions(
    file   = "data/tnx_arules_input.csv",
    format = "single",
    sep    = ",",
    header = TRUE,
    cols   = c("invoice_id", "stock_code")
    )

basket_tnxdata %>% glimpse()
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   ..@ itemInfo   :'data.frame':  4091 obs. of  1 variable:
##   .. ..$ labels: chr [1:4091] "10002" "10080" "10109" "10120" ...
##   ..@ itemsetInfo:'data.frame':  22873 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:22873] "489434" "489435" "489436" "489437" ...

1.2 Load Customer Data

We also want to load the various data about the customers such as their cohort and so on.

customer_cohort_tbl <- read_rds("data/customer_cohort_tbl.rds")

customer_cohort_tbl %>% glimpse()
## Rows: 5,852
## Columns: 5
## $ customer_id     <chr> "12346", "12347", "12348", "12349", "12350", "12351", …
## $ cohort_qtr      <chr> "2010 Q1", "2010 Q4", "2010 Q3", "2010 Q2", "2011 Q1",…
## $ cohort_ym       <chr> "2010 03", "2010 10", "2010 09", "2010 04", "2011 02",…
## $ first_tnx_date  <date> 2010-03-02, 2010-10-31, 2010-09-27, 2010-04-29, 2011-…
## $ total_tnx_count <int> 3, 8, 5, 3, 1, 1, 9, 2, 1, 2, 6, 2, 5, 10, 6, 4, 10, 2…

1.3 Load Product Data

We also want to load the free-text description of the various stock items as this will help will interpretation of the various rules we have found.

product_data_tbl <- read_rds("data/stock_description_tbl.rds")

product_data_tbl %>% glimpse()
## Rows: 4,725
## Columns: 2
## $ stock_code <chr> "10002", "10002R", "10080", "10109", "10120", "10123C", "10…
## $ desc       <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", "GRO…

2 Build Association Rules Model

We now build our association rules based on the lower support data.

The idea is to repeat some of the initial association rules analysis: we use the APRIORI algorithm to mine the rules, and then convert the discovered rules to produce a graph of the products and the rules.

With this graph, we then use the disjoint components of this graph to cluster the products, and take the largest subgraph and cluster that one according to some standard clustering.

2.1 Construct Association Rules

Having loaded the individual transaction data we now construct our basket data and use the APRIORI algorithm to discover our rules.

basket_arules <- apriori(
    basket_tnxdata,
    parameter = list(supp = 0.005, conf = 0.01)
  )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.01    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 114 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4091 item(s), 22873 transaction(s)] done [0.17s].
## sorting and recoding items ... [1146 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [6223 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
basket_arules_tbl <- basket_arules %>%
  as("data.frame") %>%
  as_tibble() %>%
  arrange(desc(lift))

basket_arules_tbl %>% glimpse()
## Rows: 6,223
## Columns: 6
## $ rules      <chr> "{22459} => {22458}", "{22458} => {22459}", "{22520,22521,2…
## $ support    <dbl> 0.005639837, 0.005639837, 0.005945875, 0.005071482, 0.00529…
## $ confidence <dbl> 0.8835616, 0.8600000, 0.9510490, 0.9508197, 0.9453125, 0.94…
## $ coverage   <dbl> 0.006383072, 0.006557950, 0.006251913, 0.005333800, 0.00559…
## $ lift       <dbl> 134.73137, 134.73137, 105.59875, 105.57329, 104.96181, 104.…
## $ count      <int> 129, 129, 136, 116, 121, 137, 131, 131, 117, 136, 121, 142,…

Having constructed the main association rules, we then convert the discovered rules into a graph.

apriori_rules_igraph <- basket_arules %>%
  plot(
    measure = "support",
    method  = "graph",
    engine  = "igraph",
    control = list(max = 20000)
    )

apriori_rules_igraph %>% summary()
## IGRAPH b609a82 DN-B 6882 14757 -- 
## + attr: name (v/c), label (v/c), type (v/n), support (v/n), confidence
## | (v/n), coverage (v/n), lift (v/n), count (v/n), order (v/n)

Having constructed the graph, we now want to visualise it.

basket_arules %>%
  head(n = 500, by = "support") %>%
  plot(
    measure  = "lift",
    method   = "graph",
    engine   = "htmlwidget"
    )

2.2 Determine Graph Clusters

With the constructed graph we now want to label the elements that are part of the disjoint components of the graph.

apriori_rules_tblgraph <- apriori_rules_igraph %>%
  igraph::as.undirected(mode = "collapse") %>%
  as_tbl_graph() %>%
  mutate(
    component_id = group_components()
    ) %>%
  group_by(component_id) %>%
  mutate(
    component_size = n()
    ) %>%
  ungroup()

apriori_rules_tblgraph %>% print()
## # A tbl_graph: 6882 nodes and 14757 edges
## #
## # A bipartite simple graph with 217 components
## #
## # Node Data: 6,882 × 11 (active)
##   name  label  type support confidence coverage  lift count order component_id
##   <chr> <chr> <dbl>   <dbl>      <dbl>    <dbl> <dbl> <int> <int>        <int>
## 1 1     10002     1      NA         NA       NA    NA    NA    NA           53
## 2 17    15036     1      NA         NA       NA    NA    NA    NA           54
## 3 23    1505…     1      NA         NA       NA    NA    NA    NA            1
## 4 24    1505…     1      NA         NA       NA    NA    NA    NA            1
## 5 25    1505…     1      NA         NA       NA    NA    NA    NA            1
## 6 56    1615…     1      NA         NA       NA    NA    NA    NA           55
## # … with 6,876 more rows, and 1 more variable: component_size <int>
## #
## # Edge Data: 14,757 × 2
##    from    to
##   <int> <int>
## 1   535   660
## 2   577   661
## 3    71   662
## # … with 14,754 more rows

From the graph, we extract the nodes that correspond to the products (as opposed to the nodes corresponding to the mined association rules). These are identified as the various numeric values attached to the rules are blank.

We also wish to add an additional column that is the size of the group, so it is easier to identify outsized subgraphs suitable for further partitioning.

product_cluster_disjoint_tbl <- apriori_rules_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(component_id) %>%
  mutate(
    cluster_size = n()
    ) %>%
  ungroup() %>%
  arrange(desc(cluster_size), label) %>%
  group_by(component_id) %>%
  mutate(
    product_group_id = sprintf("AR_DISJOINT_%03d", cur_group_id()),
    cluster_size,
    stock_code       = label
    ) %>%
  ungroup() %>%
  select(product_group_id, cluster_size, stock_code) %>%
  arrange(product_group_id, stock_code)

product_cluster_disjoint_tbl %>% glimpse()
## Rows: 659
## Columns: 3
## $ product_group_id <chr> "AR_DISJOINT_001", "AR_DISJOINT_001", "AR_DISJOINT_00…
## $ cluster_size     <int> 365, 365, 365, 365, 365, 365, 365, 365, 365, 365, 365…
## $ stock_code       <chr> "15056BL", "15056N", "15056P", "20674", "20675", "206…

We now segment up the largest disjoint subgraph using alternative clustering techniques.

We try a few different types - inspecting the output of the various algorithms to see which clustering may be the

run_subgraph_clusters <- function(graph_cluster_func, rules_tblgraph, ...) {
  subgraph_clusters_tbl <- rules_tblgraph %>%
    to_subgraph(component_size == max(component_size)) %>%
    use_series(subgraph) %>%
    morph(to_undirected) %>%
    mutate(
      sub_id = graph_cluster_func(...)
      ) %>%
    unmorph() %>%
    activate(nodes) %>%
    as_tibble() %>%
    filter(are_na(support)) %>%
    count(sub_id, name = "cluster_size", sort = TRUE) %>%
    mutate(
      sub_id = factor(1:n(), levels = 1:n())
    )
  
  return(subgraph_clusters_tbl)
}

cluster_func <- c(
    "group_fast_greedy",
    "group_infomap",
    "group_label_prop",
    "group_louvain",
    "group_spinglass"
    )

cluster_data_tbl <- tibble(cluster_func_name = cluster_func) %>%
  mutate(
    cluster_func = map(cluster_func_name, get),
    clustered    = map(cluster_func, run_subgraph_clusters,
                       rules_tblgraph = apriori_rules_tblgraph)
    ) %>%
  select(cluster_func_name, clustered) %>%
  unnest(clustered)

cluster_data_tbl %>% glimpse()
## Rows: 406
## Columns: 3
## $ cluster_func_name <chr> "group_fast_greedy", "group_fast_greedy", "group_fas…
## $ sub_id            <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ cluster_size      <int> 84, 41, 29, 27, 23, 21, 20, 18, 16, 15, 13, 12, 9, 9…

Having split this largest component into various splits, we now visualise the count and size of each cluster and use this to determine which clustering splits the data into a smaller number of larger clusters.

ggplot(cluster_data_tbl) +
  geom_col(aes(x = sub_id, y = cluster_size)) +
  geom_hline(aes(yintercept = 5), colour = "red") +
  facet_wrap(vars(cluster_func_name), scales = "free") +
  labs(
    x = "ID",
    y = "Cluster Size"
    ) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 8))

plot_tbl <- cluster_data_tbl %>%
  group_by(cluster_func_name) %>%
  count(cluster_size, name = "cluster_count", sort = TRUE) %>%
  ungroup() %>%
  mutate(cluster_size = as.factor(cluster_size))

ggplot(plot_tbl) +
  geom_col(aes(x = cluster_size, y = cluster_count, group = cluster_size)) +
  facet_wrap(vars(cluster_func_name), scales = "free") +
  labs(
    x = "Cluster Size",
    y = "Community Count",
    title = "Visualisation of Spread of Cluster Sizes"
    ) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

From this, it appears that louvain is the method of choice.

Thus, we re-run the clustering for this larger component using the chosen algorithm and use this to create our various product groups.

subgraph_groups_tbl <- apriori_rules_tblgraph %>%
  to_subgraph(component_size == max(component_size)) %>%
  use_series(subgraph) %>%
  morph(to_undirected) %>%
  mutate(
    sub_id = group_louvain()
    ) %>%
  unmorph() %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(sub_id) %>%
  mutate(
    cluster_size = n()
    ) %>%
  ungroup() %>%
  arrange(desc(cluster_size), label) %>%
  group_by(sub_id) %>%
  mutate(
    product_group_id = sprintf("AR_LARGE_%03d", cur_group_id()),
    cluster_size,
    stock_code       = label
    ) %>%
  ungroup() %>%
  select(product_group_id, cluster_size, stock_code) %>%
  arrange(product_group_id, stock_code)
  

subgraph_groups_tbl %>% glimpse()
## Rows: 365
## Columns: 3
## $ product_group_id <chr> "AR_LARGE_001", "AR_LARGE_001", "AR_LARGE_001", "AR_L…
## $ cluster_size     <int> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 1…
## $ stock_code       <chr> "20711", "20712", "20713", "21928", "21929", "21930",…

We now combine both these lists of groupings and combine them.

product_cluster_tbl <- list(
    product_cluster_disjoint_tbl,
    subgraph_groups_tbl
    ) %>%
  bind_rows() %>%
  filter(product_group_id != "AR_DISJOINT_001")

product_cluster_tbl %>% glimpse()
## Rows: 659
## Columns: 3
## $ product_group_id <chr> "AR_DISJOINT_002", "AR_DISJOINT_002", "AR_DISJOINT_00…
## $ cluster_size     <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ stock_code       <chr> "21668", "21669", "21670", "21671", "21672", "21673",…

2.3 Assign Products to Groups

We now want to look at our complete list of products and then assign them to each of our product groups. In terms of coverage, we need to check to see if all the products appearing in the most invoices.

We also want to look at the most commonly purchased items (in terms of appearance in baskets as opposed to quantity sold).

product_popular_tbl <- tnx_purchase_tbl %>%
  mutate(
    stock_code = str_to_upper(stock_code)
    ) %>%
  count(stock_code, name = "invoice_count", sort = TRUE)

product_popular_tbl %>% glimpse()
## Rows: 4,621
## Columns: 2
## $ stock_code    <chr> "85123A", "22423", "85099B", "84879", "20725", "21212", …
## $ invoice_count <int> 5188, 3428, 3360, 2777, 2678, 2654, 2141, 2121, 2117, 21…

We now combine this data to construct a product dataset containing the relevant summary data about each product.

product_data_full_tbl <- product_data_tbl %>%
  left_join(product_cluster_tbl, by = "stock_code") %>%
  left_join(product_popular_tbl, by = "stock_code") %>%
  replace_na(
    list(product_group_id = "none", cluster_size = "0")
    ) %>%
  arrange(desc(invoice_count)) %>%
  mutate(ranking = 1:n()) %>%
  semi_join(tnx_purchase_tbl, by = "stock_code") %>%
  arrange(stock_code)

product_data_full_tbl %>% glimpse()
## Rows: 4,621
## Columns: 6
## $ stock_code       <chr> "10002", "10080", "10109", "10120", "10123C", "10123G…
## $ desc             <chr> "INFLATABLE POLITICAL GLOBE", "GROOVY CACTUS INFLATAB…
## $ product_group_id <chr> "AR_DISJOINT_053", "none", "none", "none", "none", "n…
## $ cluster_size     <chr> "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"…
## $ invoice_count    <int> 317, 27, 1, 66, 50, 13, 19, 10, 133, 212, 52, 258, 16…
## $ ranking          <int> 718, 3160, 4501, 2336, 2601, 3699, 3448, 3861, 1589, …

First, let us export the table to help us inspect the data.

product_data_full_tbl %>% datatable()

To make it more obvious, we look at the products unassigned to a group and see how they rank in terms of invoice count.

product_data_full_tbl %>% filter(product_group_id == "none") %>% datatable()

3 Construct Transaction-Based Graph Clusters

We can treat the transaction data as a graph, turning both invoices and stock items into nodes on the graph, and create an edge between stock and invoices when the item occurs on the invoice.

We then cluster the graph to create groupings for the different stock codes.

stock_nodes_tbl <- tnx_purchase_tbl %>%
  select(stock_code) %>%
  distinct() %>%
  transmute(node_label = stock_code, node_type = "stock")

invoice_nodes_tbl <- tnx_purchase_tbl %>%
  select(invoice_id) %>%
  distinct() %>%
  transmute(node_label = invoice_id, node_type = "invoice")

nodes_tbl <- list(stock_nodes_tbl, invoice_nodes_tbl) %>%
  bind_rows()

edges_tbl <- tnx_purchase_tbl %>%
  select(stock_code, invoice_id, quantity, price)


basket_tblgraph <- tbl_graph(
  nodes    = nodes_tbl,
  edges    = edges_tbl,
  directed = FALSE,
  node_key = "node_label"
)

3.1 Check Graph Clustering Approaches

First we perform our basic clustering by splitting off the different disjoint components of the graph.

basket_tblgraph <- basket_tblgraph %>%
  mutate(
    component_id = group_components()
    ) %>%
  group_by(component_id) %>%
  mutate(
    component_size = n()
    ) %>%
  ungroup()

basket_tblgraph %>% print()
## # A tbl_graph: 41215 nodes and 802651 edges
## #
## # An undirected multigraph with 6 components
## #
## # Node Data: 41,215 × 4 (active)
##   node_label node_type component_id component_size
##   <chr>      <chr>            <int>          <int>
## 1 85048      stock                1          41202
## 2 79323P     stock                1          41202
## 3 79323W     stock                1          41202
## 4 22041      stock                1          41202
## 5 21232      stock                1          41202
## 6 22064      stock                1          41202
## # … with 41,209 more rows
## #
## # Edge Data: 802,651 × 4
##    from    to quantity price
##   <int> <int>    <dbl> <dbl>
## 1     1  4622       12  6.95
## 2     2  4622       12  6.75
## 3     3  4622       12  6.75
## # … with 802,648 more rows

We now want to check the sizes of the disjoint components of this graph.

basket_tblgraph %>%
  as_tibble() %>%
  filter(node_type == "stock") %>%
  count(component_id, name = "stock_count", sort = TRUE)
## # A tibble: 6 × 2
##   component_id stock_count
##          <int>       <int>
## 1            1        4616
## 2            2           1
## 3            3           1
## 4            4           1
## 5            5           1
## 6            6           1

We see that almost all the stock codes are contained in that one large component and so confine the rest of this analysis to that one large component.

run_subgraph_clusters <- function(graph_cluster_func, labelling, input_tblgraph, ...) {
  message(glue("Clustering the graph using {labelling}..."))
  
  subgraph_clusters_tbl <- input_tblgraph %>%
    mutate(
      cluster_id = graph_cluster_func(...)
      ) %>%
    activate(nodes) %>%
    as_tibble() %>%
    filter(node_type == "stock") %>%
    count(cluster_id, name = "cluster_size", sort = TRUE) %>%
    mutate(
      cluster_id = factor(1:n(), levels = 1:n())
    )
  
  return(subgraph_clusters_tbl)
}
cluster_func <- c(
    "group_fast_greedy",
    "group_infomap",
    "group_leading_eigen",
    "group_louvain"
    )

largecomp_tblgraph <- basket_tblgraph %>%
  to_subgraph(component_size == max(component_size)) %>%
  use_series(subgraph)

cluster_data_tbl <- tibble(cluster_func_name = cluster_func) %>%
  mutate(
    cluster_func = map(cluster_func_name, get),
    clustered    = map2(
      cluster_func, cluster_func_name,
      run_subgraph_clusters,
      input_tblgraph = largecomp_tblgraph
      )
    ) %>%
  select(cluster_func_name, clustered) %>%
  unnest(clustered)

cluster_data_tbl %>% glimpse()
## Rows: 217
## Columns: 3
## $ cluster_func_name <chr> "group_fast_greedy", "group_fast_greedy", "group_fas…
## $ cluster_id        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1…
## $ cluster_size      <int> 1659, 1365, 833, 753, 3, 1, 1, 1, 3778, 54, 49, 25, …

Having created a summary of the data splits, we now want to construct a visualisation of how the various cluster routines split the data.

To do this, we turn the size of each cluster into a ‘label’ and then count how many clusters of that size there are. We then use this summary data to construct barplots of the size.

plot_tbl <- cluster_data_tbl %>%
  group_by(cluster_func_name) %>%
  count(cluster_size, name = "cluster_count", sort = TRUE) %>%
  ungroup() %>%
  mutate(cluster_size = as.factor(cluster_size))


ggplot(plot_tbl) +
  geom_col(aes(x = cluster_size, y = cluster_count, group = cluster_size)) +
  facet_wrap(vars(cluster_func_name), scales = "free") +
  labs(
    x = "Cluster Size",
    y = "Community Count",
    title = "Visualisation of Spread of Cluster Sizes"
    ) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

From this graphic, we see that we want to use group_louvain gives us the most even split across the data - though the sizes are still hugely unequal.

3.2 Create Cluster-Based Allocation

We now use this algorithm to cluster this large component in the graph, and this gives us an alternative allocation of the each stock_code to a product group.

largecomp_clustered_tbl <- largecomp_tblgraph %>%
  mutate(
    cluster_id = group_louvain()
    ) %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(node_type == "stock") %>%
  mutate(
    cluster_group = sprintf("TNX_%03d", cluster_id)
    ) %>%
  select(stock_code = node_label, cluster_group)

largecomp_clustered_tbl %>% glimpse()
## Rows: 4,616
## Columns: 2
## $ stock_code    <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", …
## $ cluster_group <chr> "TNX_005", "TNX_001", "TNX_001", "TNX_008", "TNX_003", "…

3.3 Combine Clustering Data

We now want to combine this data to construct our stock code allocations.

other_tbl <- basket_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(
    node_type == "stock",
    component_size != max(component_size)
    ) %>%
  transmute(
    stock_code = node_label, cluster_group = "TNX_010"
    )

product_group_tnxgroups_tbl <- list(
    largecomp_clustered_tbl,
    other_tbl
    ) %>%
  bind_rows() %>%
  arrange(stock_code) %>%
  inner_join(product_data_tbl, by = "stock_code") %>%
  select(stock_code, product_group = cluster_group, desc)

product_group_tnxgroups_tbl %>% glimpse()
## Rows: 4,621
## Columns: 3
## $ stock_code    <chr> "10002", "10080", "10109", "10120", "10123C", "10123G", …
## $ product_group <chr> "TNX_004", "TNX_002", "TNX_003", "TNX_004", "TNX_004", "…
## $ desc          <chr> "INFLATABLE POLITICAL GLOBE", "GROOVY CACTUS INFLATABLE"…

4 Construct RFM Customer Segments

We now wish to repeat our RFM analysis, and then we reassign the customer base to each of these groupings.

segment_names <- c(
  "Champions", "Loyal Customers", "Potential Loyalist", "New Customers",
  "Promising", "Need Attention", "About To Sleep", "At Risk",
  "Can't Lose Them", "Lost"
  )

recency_lower   <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper   <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower  <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper  <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)

segment_defs_tbl <- tibble(
  segment_names,
  recency_lower,
  recency_upper,
  frequency_lower,
  frequency_upper,
  monetary_lower,
  monetary_upper
  )

segment_defs_tbl %>% glimpse()
## Rows: 10
## Columns: 7
## $ segment_names   <chr> "Champions", "Loyal Customers", "Potential Loyalist", …
## $ recency_lower   <dbl> 4, 2, 3, 4, 3, 2, 2, 1, 1, 1
## $ recency_upper   <dbl> 5, 5, 5, 5, 4, 3, 3, 2, 1, 2
## $ frequency_lower <dbl> 4, 3, 1, 1, 1, 2, 1, 2, 4, 1
## $ frequency_upper <dbl> 5, 5, 3, 1, 1, 3, 2, 5, 5, 2
## $ monetary_lower  <dbl> 4, 3, 1, 1, 1, 2, 1, 2, 4, 1
## $ monetary_upper  <dbl> 5, 5, 3, 1, 1, 3, 2, 5, 5, 2

We first visually inspect these segment definitions and the bands.

segments_show_tbl <- segment_defs_tbl %>%
  mutate(
    recency   = glue("{recency_lower}-{recency_upper}")     %>% as.character(),
    frequency = glue("{frequency_lower}-{frequency_upper}") %>% as.character(),
    monetary  = glue("{monetary_lower}-{monetary_upper}")   %>% as.character()
    ) %>%
  select(
    segment_names, recency, frequency, monetary
    )

segments_show_tbl %>%
  datatable(
    colnames = c("Segment", "R", "F", "M"),
    options = list(
      columnDefs = list(list(className = 'dt-left', targets = 0:4))
      )
    )

We now construct the RFM data from the purchase data and assign each of the customers to a segment based on their RFM score.

There is a reasonable number of transactions with a missing customer_id, so we exclude this from the analysis.

customer_rfmdata <- tnx_purchase_tbl %>%
  filter(
    !are_na(customer_id),
    invoice_date <= training_data_date
    ) %>%
  group_by(invoice_date, customer_id) %>%
  summarise(
    .groups = "drop",
    
    total_spend = sum(stock_value)
    ) %>%
  rfm_table_order(
    customer_id   = customer_id,
    order_date    = invoice_date,
    revenue       = total_spend,
    analysis_date = training_data_date
    )

customer_rfmdata %>% print()
## # A tibble: 4,691 × 9
##    customer_id date_most_recent recency_days transaction_count amount
##    <chr>       <date>                  <dbl>             <dbl>  <dbl>
##  1 12346       2011-01-18                 72                 3 77353.
##  2 12347       2011-01-26                 64                 3  2510.
##  3 12348       2011-01-25                 65                 3  1061.
##  4 12349       2010-10-28                154                 2  2221.
##  5 12350       2011-02-02                 57                 1   294.
##  6 12351       2010-11-29                122                 1   301.
##  7 12352       2011-03-22                  9                 6   985.
##  8 12353       2010-10-27                155                 1   318.
##  9 12355       2010-05-21                314                 1   488.
## 10 12356       2011-01-18                 72                 4  5074.
## # … with 4,681 more rows, and 4 more variables: recency_score <int>,
## #   frequency_score <int>, monetary_score <int>, rfm_score <dbl>

4.1 Visualise RFM Data

As we explored earlier, the rfm package provides a number of inbuilt descriptive visualisations.

First we look at the count of customers at each order count:

customer_rfmdata %>%
  rfm_order_dist(print_plot = FALSE)

We also have a few summary plots - showing the histograms of the recency, frequency and monetary measures.

customer_rfmdata %>%
  rfm_histograms(print_plot = FALSE) +
    scale_x_continuous(labels = label_comma()) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Finally, we look at each of the three bivariate plots to explore the relationship between the three quantities.

customer_rfmdata %>%
  rfm_rm_plot(print_plot = FALSE) +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma())

customer_rfmdata %>%
  rfm_rf_plot(print_plot = FALSE) +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma())

customer_rfmdata %>%
  rfm_fm_plot(print_plot = FALSE) +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma())

4.2 Assign Customer Segments

We now assign each customer to a segment and this allows us to analyse each of the segments.

customer_segments_tbl <- customer_rfmdata %>%
  rfm_segment(
    segment_names   = segment_names,
    recency_lower   = recency_lower,
    recency_upper   = recency_upper,
    frequency_lower = frequency_lower,
    frequency_upper = frequency_upper,
    monetary_lower  = monetary_lower,
    monetary_upper  = monetary_upper
    )

customer_segments_tbl %>% glimpse()
## Rows: 4,691
## Columns: 9
## $ customer_id       <chr> "12346", "12347", "12348", "12349", "12350", "12351"…
## $ segment           <chr> "Loyal Customers", "Loyal Customers", "Loyal Custome…
## $ rfm_score         <dbl> 435, 435, 434, 224, 412, 312, 543, 212, 112, 445, 31…
## $ transaction_count <dbl> 3, 3, 3, 2, 1, 1, 6, 1, 1, 4, 1, 3, 8, 3, 4, 1, 1, 1…
## $ recency_days      <dbl> 72, 64, 65, 154, 57, 122, 9, 155, 314, 72, 135, 122,…
## $ amount            <dbl> 77352.96, 2510.50, 1061.40, 2221.14, 294.40, 300.93,…
## $ recency_score     <int> 4, 4, 4, 2, 4, 3, 5, 2, 1, 4, 3, 3, 4, 3, 4, 4, 4, 1…
## $ frequency_score   <int> 3, 3, 3, 2, 1, 1, 4, 1, 1, 4, 1, 3, 5, 3, 4, 1, 1, 1…
## $ monetary_score    <int> 5, 5, 4, 4, 2, 2, 3, 2, 2, 5, 5, 5, 5, 4, 2, 2, 2, 2…

We want to plot the count of each of the customer segments, before we calculate the various summary statistics.

customer_segment_count_tbl <- customer_segments_tbl %>%
  count(segment, name = "count", sort = TRUE)

ggplot(customer_segment_count_tbl) +
  geom_col(aes(x = segment, y = count, fill = segment)) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(
    x = "Segment",
    y = "Count"
    ) +
  theme(
    axis.text.x = element_text(angle = 20, vjust = 0.5),
    legend.position = "none"
    )

Again, rfm provides a number of inbuilt plots of the segments, but as they create very simple summary plots we create these plots ourselves and this allows us to summarise the segments however we wish.

plot_tbl <- customer_segments_tbl %>%
  select(
    customer_id, segment,
    Transactions  = transaction_count,
    Recency       = recency_days,
    `Total Spend` = amount
    ) %>%
  pivot_longer(
    !c(customer_id, segment),
    names_to = "quantity",
    values_to = "value"
    ) %>%
  mutate(
    value = pmax(0.1, value)
    )

ggplot(plot_tbl) +
  geom_boxplot(aes(x = segment, y = value, fill = segment)) +
  expand_limits(y = 0.1) +
  facet_wrap(vars(quantity), ncol = 2, scales = "free_y") +
  scale_y_log10(labels = label_comma()) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(
    x = "Customer Segment",
    y = "Value"
    ) +
  theme(
    axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8),
    legend.position = "none"
    )

4.3 Inspect Segment Validation

Now that we have assigned customers to segments, we use these segments to assess the transactions made after the cutoff date training_data_date.

segments_alloc_tbl <- customer_segments_tbl %>%
  select(customer_id, segment)

daily_spend_tbl <- tnx_purchase_tbl %>%
  filter(invoice_date > training_data_date) %>%
  group_by(invoice_date, customer_id) %>%
  summarise(
    .groups = "drop",
    
    daily_spend = sum(stock_value)
    ) %>%
  left_join(segments_alloc_tbl, by = "customer_id") %>%
  replace_na(list(segment = "New Customer"))

daily_spend_tbl %>% glimpse()
## Rows: 12,376
## Columns: 4
## $ invoice_date <date> 2011-04-01, 2011-04-01, 2011-04-01, 2011-04-01, 2011-04-…
## $ customer_id  <chr> "12523", "12569", "12731", "12748", "12877", "12949", "13…
## $ daily_spend  <dbl> 138.75, 222.68, 252.00, 25.50, 87.10, 1534.30, 1317.80, 6…
## $ segment      <chr> "Champions", "New Customer", "Champions", "Champions", "C…

Having constructed this data, we now calculate some per-customer summary statistics.

postdate_customer_stats_tbl <- daily_spend_tbl %>%
  group_by(customer_id, segment) %>%
  summarise(
    .groups = "drop",
    
    total_transactions = n(),
    total_spend        = sum(daily_spend)
    )

postdate_customer_stats_tbl %>% glimpse()
## Rows: 3,843
## Columns: 4
## $ customer_id        <chr> "12347", "12348", "12349", "12352", "12353", "12354…
## $ segment            <chr> "Loyal Customers", "Loyal Customers", "At Risk", "L…
## $ total_transactions <int> 5, 2, 1, 3, 1, 1, 1, 2, 1, 2, 2, 3, 9, 2, 4, 1, 1, …
## $ total_spend        <dbl> 3122.82, 597.00, 1457.55, 744.23, 89.00, 1079.40, 4…

First we compare the segment counts from both pre- and post- dates. The first metric to check is the proportion of the segment in the dataset, though we exclude newly arrived customers in the post-date dataset to enable a direct comparison.

pre_data_tbl <- customer_segment_count_tbl %>%
  mutate(prop = count / sum(count))

post_data_tbl <- postdate_customer_stats_tbl %>%
  filter(segment != "New Customer") %>%
  count(segment, name = "count", sort = TRUE) %>%
  mutate(prop = count / sum(count))

comparison_tbl <- list(
    Training   = pre_data_tbl,
    Validation = post_data_tbl
    ) %>%
  bind_rows(.id = "data")

ggplot(comparison_tbl) +
  geom_col(aes(x = segment, y = prop, fill = data), position = "dodge") +
  labs(
    x = "Segment",
    y = "Segment Proportion",
    fill = "Dataset"
    ) +
  scale_fill_brewer(type = "qual", palette = "Dark2") +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.5))

Next, we check our RFM stats according the validation data as of the final date in the dataset.

First we construct the data and then construct boxplots of each segment.

validation_date <- daily_spend_tbl %>% pull(invoice_date) %>% max()

validation_rfm_data_tbl <- daily_spend_tbl %>%
  group_by(segment, customer_id) %>%
  summarise(
    .groups = "drop",
    
    transaction_count  = n(),
    recency_days       = (validation_date - max(invoice_date)) %>% as.numeric(),
    amount             = sum(daily_spend)
    )

validation_rfm_data_tbl %>% glimpse()
## Rows: 3,843
## Columns: 5
## $ segment           <chr> "About To Sleep", "About To Sleep", "About To Sleep"…
## $ customer_id       <chr> "12353", "12445", "12457", "12475", "12519", "12787"…
## $ transaction_count <int> 1, 1, 6, 1, 1, 3, 1, 1, 2, 5, 2, 1, 1, 1, 3, 1, 1, 3…
## $ recency_days      <dbl> 204, 22, 66, 53, 63, 9, 192, 96, 84, 4, 4, 196, 64, …
## $ amount            <dbl> 89.00, 77.40, 2043.23, 669.38, 286.84, 418.66, 294.9…

Having constructed the table - we view the data as a JS datatable.

validation_rfm_data_tbl %>% datatable()

We now produce boxplots of the three metrics using these segments, and also look at the new customers as a separate category.

plot_tbl <- validation_rfm_data_tbl %>%
  select(
    customer_id, segment,
    Transactions  = transaction_count,
    Recency       = recency_days,
    `Total Spend` = amount
    ) %>%
  pivot_longer(
    !c(customer_id, segment),
    names_to  = "quantity",
    values_to = "value"
    ) %>%
  mutate(
    value = pmax(0.1, value)
    )

ggplot(plot_tbl) +
  geom_boxplot(aes(x = segment, y = value, fill = segment)) +
  expand_limits(y = 0.1) +
  facet_wrap(vars(quantity), ncol = 2, scales = "free_y") +
  scale_y_log10(labels = label_comma()) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(
    x = "Customer Segment",
    y = "Value"
    ) +
  theme(
    axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8),
    legend.position = "none"
    )

As an alternative plot, we also look at the violin plots rather boxplots.

ggplot(plot_tbl) +
  geom_violin(aes(x = segment, y = value, fill = segment)) +
  expand_limits(y = 0.1) +
  facet_wrap(vars(quantity), ncol = 2, scales = "free_y") +
  scale_y_log10(labels = label_comma()) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(
    x = "Customer Segment",
    y = "Value"
    ) +
  theme(
    axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8),
    legend.position = "none"
    )

We certainly see that our “Champions” are the ‘best’ segment in terms of the three metrics we measure. That said, our new customers also appear to have desirable metrics on aggregate.

4.4 Simple Product Description Analysis

With each of the products classified into different groups we now inspect the product descriptions of those products.

We split the product descriptions by product_group and construct a word cloud for each one. Word clouds are a simple starting point to help visualise common words in a group.

create_wordcloud <- function(tokens_tbl, seed = 42) {
  cloud_plot <- tokens_tbl %>%
    count(word, name = "freq", sort = TRUE) %>%
    slice_max(order_by = freq, n = 100) %>%
    ggwordcloud2(seed = seed)
  
  return(cloud_plot)
}

product_group_tokens_tbl <- product_group_tnxgroups_tbl %>%
  group_by(product_group) %>%
  mutate(product_count = n()) %>%
  mutate(label = glue("{product_group} ({product_count})")) %>%
  ungroup() %>%
  unnest_tokens(word, desc)

w_tbl <- product_group_tokens_tbl %>%
  filter(product_group != "TNX_010") %>%
  group_nest(label, product_count, .key = "token_data") %>%
  mutate(
    cloud = map2(token_data, product_count, create_wordcloud)
    )

plot_grid(
    plotlist = w_tbl$cloud,
    labels   = w_tbl$label,
    ncol     = 3
    )

We now repeat this exercise but using a slightly lower-level use of the word cloud data.

wc_input_tbl <- product_group_tokens_tbl %>%
  count(label, word, name = "freq", sort = TRUE) %>%
  group_by(label) %>%
  slice_max(order_by = freq, n = 50) %>%
  ungroup()


ggplot(wc_input_tbl %>% filter(label != "TNX_010 (6)")) +
  geom_text_wordcloud_area(
    aes(label = word, size = freq),
    shape = "square", seed = 42, rm_outside = TRUE) +
  facet_wrap(vars(label), scales = "free", ncol = 3) +
  scale_size_area(max_size = 15) +
  theme_minimal()

5 Analyse Categorical Co-occurrences

We now need to perform a correspondence analysis (CA) for the co-occurence of customer grouping and product grouping to see what types of products are associated with the various groups.

5.1 Construct Customer and Product Allocations

We need to construct a lookup table for all customers in our dataset. Those customers which first appeared in our ‘validation’ data are assigned the grouping “New Customer”.

customer_segment_allocation_tbl <- tnx_purchase_tbl %>%
  drop_na(customer_id) %>%
  select(customer_id) %>%
  distinct() %>%
  left_join(segments_alloc_tbl, by = "customer_id") %>%
  replace_na(list(segment = "New Customer")) %>%
  arrange(customer_id)

customer_segment_allocation_tbl %>% glimpse()
## Rows: 5,852
## Columns: 2
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", "123…
## $ segment     <chr> "Loyal Customers", "Loyal Customers", "Loyal Customers", "…

We also construct a lookup table allocating each stock_code to a product_group_id.

product_group_allocation_tbl <- product_data_full_tbl %>%
  select(stock_code, product_group_id) %>%
  arrange(stock_code)

product_group_allocation_tbl %>% glimpse()
## Rows: 4,621
## Columns: 2
## $ stock_code       <chr> "10002", "10080", "10109", "10120", "10123C", "10123G…
## $ product_group_id <chr> "AR_DISJOINT_053", "none", "none", "none", "none", "n…

5.2 Construct Co-occurence Analysis

We now wish to add the various product and customer groupings to our transaction data to construct a table we can later us for some correspondence analysis. Prior to pursuing library routines to perform this analysis, we first look at some of the co-occurence frequencies.

tnx_correspondence_tbl <- tnx_purchase_tbl %>%
  select(
    invoice_id, invoice_date, stock_code, customer_id
    ) %>%
  mutate(
    tnx_ym    = invoice_date %>% format("%Y%m"),
    tnx_dow   = invoice_date %>% format("%A"),
    tnx_month = invoice_date %>% format("%B"),
    tnx_qtr   = invoice_date %>% as.yearqtr() %>% as.character()
    ) %>%
  inner_join(customer_cohort_tbl, by = "customer_id") %>%
  inner_join(customer_segment_allocation_tbl, by = "customer_id") %>%
  inner_join(product_group_tnxgroups_tbl, by = "stock_code")

tnx_correspondence_tbl %>% glimpse()
## Rows: 802,651
## Columns: 15
## $ invoice_id      <chr> "489434", "489434", "489434", "489434", "489434", "489…
## $ invoice_date    <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-…
## $ stock_code      <chr> "85048", "79323P", "79323W", "22041", "21232", "22064"…
## $ customer_id     <chr> "13085", "13085", "13085", "13085", "13085", "13085", …
## $ tnx_ym          <chr> "200912", "200912", "200912", "200912", "200912", "200…
## $ tnx_dow         <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",…
## $ tnx_month       <chr> "December", "December", "December", "December", "Decem…
## $ tnx_qtr         <chr> "2009 Q4", "2009 Q4", "2009 Q4", "2009 Q4", "2009 Q4",…
## $ cohort_qtr      <chr> "2009 Q4", "2009 Q4", "2009 Q4", "2009 Q4", "2009 Q4",…
## $ cohort_ym       <chr> "2009 12", "2009 12", "2009 12", "2009 12", "2009 12",…
## $ first_tnx_date  <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-…
## $ total_tnx_count <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 55, 55, 55, 55, 55…
## $ segment         <chr> "Champions", "Champions", "Champions", "Champions", "C…
## $ product_group   <chr> "TNX_005", "TNX_001", "TNX_001", "TNX_008", "TNX_003",…
## $ desc            <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LI…

We now want to do some basic analysis based on the relative frequencies of the values of both the product groups and the customer segments.

segment_group_contingency_tbl <- tnx_correspondence_tbl %>%
  filter(
    invoice_date <= training_data_date
    ) %>%
  count(segment, product_group, name = "group_count") %>%
  mutate(
    obs_prop = group_count / sum(group_count)
    )

segment_group_contingency_tbl %>% glimpse()
## Rows: 75
## Columns: 4
## $ segment       <chr> "About To Sleep", "About To Sleep", "About To Sleep", "A…
## $ product_group <chr> "TNX_001", "TNX_002", "TNX_003", "TNX_004", "TNX_005", "…
## $ group_count   <int> 973, 270, 1045, 1778, 1433, 265, 525, 129, 161, 5232, 45…
## $ obs_prop      <dbl> 1.944406e-03, 5.395576e-04, 2.088288e-03, 3.553086e-03, …
segment_props_tbl <- segment_group_contingency_tbl %>%
  count(segment, wt = group_count, name = "segment_count") %>%
  mutate(
    segment_prop = segment_count / sum(segment_count)
    )

prodgroup_props_tbl <- segment_group_contingency_tbl %>%
  count(product_group, wt = group_count, name = "prodgroup_count") %>%
  mutate(
    prodgroup_prop = prodgroup_count / sum(prodgroup_count)
    )

segment_group_independ_tbl <- expand_grid(
    segment_props_tbl,
    prodgroup_props_tbl
    ) %>%
  transmute(
    segment, product_group, segment_count, prodgroup_count,
    independ_prop = segment_prop * prodgroup_prop
    )

segment_group_independ_tbl %>% glimpse()
## Rows: 80
## Columns: 5
## $ segment         <chr> "About To Sleep", "About To Sleep", "About To Sleep", …
## $ product_group   <chr> "TNX_001", "TNX_002", "TNX_003", "TNX_004", "TNX_005",…
## $ segment_count   <int> 6579, 6579, 6579, 6579, 6579, 6579, 6579, 6579, 6579, …
## $ prodgroup_count <int> 106681, 27351, 94400, 94937, 64105, 31001, 47590, 1560…
## $ independ_prop   <dbl> 2.802819e-03, 7.185899e-04, 2.480161e-03, 2.494270e-03…

We now compare the observed proportions of these combinations against the “theoretical” proportions under the assumptions of independence.

segment_group_combined_tbl <- segment_group_independ_tbl %>%
  left_join(segment_group_contingency_tbl, by = c("segment", "product_group")) %>%
  replace_na(list(group_count = 0, obs_prop = 0)) %>%
  mutate(
    ignore = (group_count == 0) | (independ_prop <= 1e-4),
    diff = (obs_prop - independ_prop),
    perc = if_else(ignore == TRUE, 0, (obs_prop / independ_prop) - 1)
    )

segment_group_combined_tbl %>% glimpse()
## Rows: 80
## Columns: 10
## $ segment         <chr> "About To Sleep", "About To Sleep", "About To Sleep", …
## $ product_group   <chr> "TNX_001", "TNX_002", "TNX_003", "TNX_004", "TNX_005",…
## $ segment_count   <int> 6579, 6579, 6579, 6579, 6579, 6579, 6579, 6579, 6579, …
## $ prodgroup_count <int> 106681, 27351, 94400, 94937, 64105, 31001, 47590, 1560…
## $ independ_prop   <dbl> 2.802819e-03, 7.185899e-04, 2.480161e-03, 2.494270e-03…
## $ group_count     <dbl> 973, 270, 1045, 1778, 1433, 265, 525, 129, 161, 0, 523…
## $ obs_prop        <dbl> 1.944406e-03, 5.395576e-04, 2.088288e-03, 3.553086e-03…
## $ ignore          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ diff            <dbl> -8.584131e-04, -1.790324e-04, -3.918737e-04, 1.058817e…
## $ perc            <dbl> -0.306267796, -0.249144013, -0.158003300, 0.424499648,…

Now that we have combined this data, we construct a heatmap of the proportions.

ggplot(segment_group_combined_tbl) +
  geom_tile(aes(x = segment, y = product_group, fill = diff)) +
  geom_text(aes(x = segment, y = product_group,
                label = label_comma(accuracy = 1)(group_count))) +
  labs(
    x = "Customer Segment",
    y = "Product Group",
    fill = "Diff"
    ) +
  scale_fill_gradient2(low = "blue", high = "red", midpoint = 0) +
  ggtitle("Heatmap of Differences Between Observed and Theoretical Proportions") +
  theme(axis.text.x = element_text(angle = 10, vjust = 0.5))

We now make a similar plot, but plotting the relative percentage difference between the theoretical proportion under assumptions and independence and what was observed.

ggplot(segment_group_combined_tbl) +
  geom_tile(aes(x = segment, y = product_group, fill = perc)) +
  geom_text(aes(x = segment, y = product_group,
                label = label_comma(accuracy = 1)(group_count))) +
  labs(
    x = "Customer Segment",
    y = "Product Group",
    fill = "Perc"
    ) +
  scale_fill_gradient2(low = "blue", high = "red", midpoint = 0) +
  ggtitle("Heatmap of Percentage Differences Between Observed and Theoretical Proportions") +
  theme(axis.text.x = element_text(angle = 10, vjust = 0.5))

5.3 Construct Balloon Plot

Another way to visualise this co-occurence of values is to construct a balloon plot, which shows the counts of each combination of the categorical values within our dataset.

plot_tbl <- tnx_correspondence_tbl %>%
  count(segment, product_group, name = "freq_count")

ggballoonplot(
    plot_tbl,
    show.label = TRUE,
    ggtheme = theme_cowplot()
    ) +
  labs(
    x = "Customer Segment",
    y = "Product Group",
    size = "Count"
    ) +
  theme(axis.text.x = element_text(angle = 20))

5.4 Perform Basic Correspondence Analysis

To run the various routines to calculate the correspondence analysis we need to convert our data into a matrix format required by the libraries.

To do this efficiently, we construct a quick function to convert these data tables into a matrix with the first column becoming the row name for the matrix.

make_df_matrix <- function(data_tbl) {
  seg_names <- data_tbl %>% pull(1)
  
  data_mat <- data_tbl %>%
    select(-1) %>%
    as.matrix() %>%
    set_rownames(seg_names)
  
  return(data_mat)
}
segment_group_freq_tbl <- tnx_correspondence_tbl %>%
  count(segment, product_group, name = "freq_count") %>%
  pivot_wider(
    id_cols     = segment,
    names_from  = product_group,
    values_from = freq_count,
    values_fill = 0
    )

segment_group_mat <- segment_group_freq_tbl %>%
  make_df_matrix()

segment_group_mat %>% glimpse()
##  int [1:9, 1:10] 1325 5637 77313 3448 29463 2394 6587 3343 8660 1271 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:9] "About To Sleep" "At Risk" "Champions" "Lost" ...
##   ..$ : chr [1:10] "TNX_001" "TNX_002" "TNX_003" "TNX_004" ...

Now that we have the frequency matrix we can run the CA routine and visualise the outputs.

segment_group_ca <- segment_group_mat %>% CA(graph = FALSE)

First we visualise the variance determined from the eigenvalues of the matrix. This gives us an indication of the ‘true’ underlying dimensionality of the data.

segment_group_ca %>% fviz_eig()

From this plot, we see that we can capture almost 94% of the variance with just two dimensions.

We then construct the biplot - which is a way of visualising the co-occurrences of each of the categorical values.

segment_group_ca %>%
  fviz_ca_biplot(
    repel = TRUE,
    title = "CA Biplot of Customer Segment Against Product Group"
    )

According to the biplots, there is a suggested relationship between customers in the “Champions” category and those products in grouping “TNX_009” - we ignore “TNX_010” as it is very small.

As before, let us look at a wordcloud on the types of items in that.

wc_009_tbl <- product_group_tokens_tbl %>%
  filter(product_group == "TNX_009") %>%
  count(word, name = "freq") %>%
  slice_max(order_by = freq, n = 100)

wc_plot <- ggwordcloud2(
    data    = wc_009_tbl,
    shuffle = FALSE,
    size    = 4,
    seed    = 42421
    )

wc_plot %>% plot()

We also see that customers designated “Potential Loyalist” are connected to “TNX_005”. So, we show a word cloud for this group also.

wc_005_tbl <- product_group_tokens_tbl %>%
  filter(product_group == "TNX_005") %>%
  count(word, name = "freq") %>%
  slice_max(order_by = freq, n = 100)

wc_plot <- ggwordcloud2(
    data    = wc_005_tbl,
    shuffle = FALSE,
    size    = 4,
    seed    = 42422
    )

wc_plot %>% plot()

6 Write Data to Disk

We now want to write this data to the disk for later use.

product_group_tnxgroups_tbl %>% write_rds("data/product_group_tnxgroups_tbl.rds")

customer_rfmdata      %>% write_rds("data/customer_rfmdata.rds")
customer_segments_tbl %>% write_rds("data/customer_segments_tbl.rds")

validation_rfm_data_tbl %>% write_rds("data/validation_rfm_data_tbl.rds")

7 R Environment

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.0 (2021-05-18)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2021-11-18                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package       * version date       lib source        
##  abind           1.4-5   2016-07-21 [1] RSPM (R 4.1.0)
##  arules        * 1.6-8   2021-05-17 [1] RSPM (R 4.1.0)
##  arulesViz     * 1.5-0   2021-05-21 [1] RSPM (R 4.1.0)
##  assertthat      0.2.1   2019-03-21 [1] RSPM (R 4.1.0)
##  backports       1.2.1   2020-12-09 [1] RSPM (R 4.1.0)
##  bit             4.0.4   2020-08-04 [1] RSPM (R 4.1.0)
##  bit64           4.0.5   2020-08-30 [1] RSPM (R 4.1.0)
##  bookdown        0.22    2021-04-22 [1] RSPM (R 4.1.0)
##  broom           0.7.9   2021-07-27 [1] RSPM (R 4.1.0)
##  bslib           0.2.5.1 2021-05-18 [1] RSPM (R 4.1.0)
##  cachem          1.0.5   2021-05-15 [1] RSPM (R 4.1.0)
##  car             3.0-11  2021-06-27 [1] RSPM (R 4.1.0)
##  carData         3.0-4   2020-05-22 [1] RSPM (R 4.1.0)
##  cellranger      1.1.0   2016-07-27 [1] RSPM (R 4.1.0)
##  cli             3.0.1   2021-07-17 [1] RSPM (R 4.1.0)
##  cluster         2.1.2   2021-04-17 [2] CRAN (R 4.1.0)
##  codetools       0.2-18  2020-11-04 [2] CRAN (R 4.1.0)
##  colorspace      2.0-2   2021-06-24 [1] RSPM (R 4.1.0)
##  conflicted    * 1.0.4   2019-06-21 [1] RSPM (R 4.1.0)
##  cowplot       * 1.1.1   2020-12-30 [1] RSPM (R 4.1.0)
##  crayon          1.4.1   2021-02-08 [1] RSPM (R 4.1.0)
##  crosstalk       1.1.1   2021-01-12 [1] RSPM (R 4.1.0)
##  curl            4.3.2   2021-06-23 [1] RSPM (R 4.1.0)
##  data.table      1.14.0  2021-02-21 [1] RSPM (R 4.1.0)
##  DBI             1.1.1   2021-01-15 [1] RSPM (R 4.1.0)
##  dbplyr          2.1.1   2021-04-06 [1] RSPM (R 4.1.0)
##  digest          0.6.27  2020-10-24 [1] RSPM (R 4.1.0)
##  dplyr         * 1.0.7   2021-06-18 [1] RSPM (R 4.1.0)
##  DT            * 0.18    2021-04-14 [1] RSPM (R 4.1.0)
##  ellipsis        0.3.2   2021-04-29 [1] RSPM (R 4.1.0)
##  evaluate        0.14    2019-05-28 [1] RSPM (R 4.1.0)
##  factoextra    * 1.0.7   2020-04-01 [1] RSPM (R 4.1.0)
##  FactoMineR    * 2.4     2020-12-11 [1] RSPM (R 4.1.0)
##  fansi           0.5.0   2021-05-25 [1] RSPM (R 4.1.0)
##  farver          2.1.0   2021-02-28 [1] RSPM (R 4.1.0)
##  fastmap         1.1.0   2021-01-25 [1] RSPM (R 4.1.0)
##  flashClust      1.01-2  2012-08-21 [1] RSPM (R 4.1.0)
##  forcats       * 0.5.1   2021-01-27 [1] RSPM (R 4.1.0)
##  foreign         0.8-81  2020-12-22 [2] CRAN (R 4.1.0)
##  fs              1.5.0   2020-07-31 [1] RSPM (R 4.1.0)
##  furrr         * 0.2.3   2021-06-25 [1] RSPM (R 4.1.0)
##  future        * 1.21.0  2020-12-10 [1] RSPM (R 4.1.0)
##  generics        0.1.0   2020-10-31 [1] RSPM (R 4.1.0)
##  ggplot2       * 3.3.5   2021-06-25 [1] RSPM (R 4.1.0)
##  ggpubr        * 0.4.0   2020-06-27 [1] RSPM (R 4.1.0)
##  ggrepel         0.9.1   2021-01-15 [1] RSPM (R 4.1.0)
##  ggsignif        0.6.2   2021-06-14 [1] RSPM (R 4.1.0)
##  ggwordcloud   * 0.5.0   2019-06-02 [1] RSPM (R 4.1.0)
##  globals         0.14.0  2020-11-22 [1] RSPM (R 4.1.0)
##  glue          * 1.4.2   2020-08-27 [1] RSPM (R 4.1.0)
##  gtable          0.3.0   2019-03-25 [1] RSPM (R 4.1.0)
##  haven           2.4.3   2021-08-04 [1] RSPM (R 4.1.0)
##  highr           0.9     2021-04-16 [1] RSPM (R 4.1.0)
##  hms             1.1.0   2021-05-17 [1] RSPM (R 4.1.0)
##  htmltools       0.5.1.1 2021-01-22 [1] RSPM (R 4.1.0)
##  htmlwidgets     1.5.3   2020-12-10 [1] RSPM (R 4.1.0)
##  httr            1.4.2   2020-07-20 [1] RSPM (R 4.1.0)
##  igraph          1.2.6   2020-10-06 [1] RSPM (R 4.1.0)
##  janeaustenr     0.1.5   2017-06-10 [1] RSPM (R 4.1.0)
##  jquerylib       0.1.4   2021-04-26 [1] RSPM (R 4.1.0)
##  jsonlite        1.7.2   2020-12-09 [1] RSPM (R 4.1.0)
##  knitr           1.33    2021-04-24 [1] RSPM (R 4.1.0)
##  labeling        0.4.2   2020-10-20 [1] RSPM (R 4.1.0)
##  lattice         0.20-44 2021-05-02 [2] CRAN (R 4.1.0)
##  leaps           3.1     2020-01-16 [1] RSPM (R 4.1.0)
##  lifecycle       1.0.0   2021-02-15 [1] RSPM (R 4.1.0)
##  listenv         0.8.0   2019-12-05 [1] RSPM (R 4.1.0)
##  lubridate       1.7.10  2021-02-26 [1] RSPM (R 4.1.0)
##  magrittr      * 2.0.1   2020-11-17 [1] RSPM (R 4.1.0)
##  MASS            7.3-54  2021-05-03 [2] CRAN (R 4.1.0)
##  Matrix        * 1.3-3   2021-05-04 [2] CRAN (R 4.1.0)
##  modelr          0.1.8   2020-05-19 [1] RSPM (R 4.1.0)
##  munsell         0.5.0   2018-06-12 [1] RSPM (R 4.1.0)
##  openxlsx        4.2.4   2021-06-16 [1] RSPM (R 4.1.0)
##  parallelly      1.27.0  2021-07-19 [1] RSPM (R 4.1.0)
##  pillar          1.6.2   2021-07-29 [1] RSPM (R 4.1.0)
##  pkgconfig       2.0.3   2019-09-22 [1] RSPM (R 4.1.0)
##  png             0.1-7   2013-12-03 [1] RSPM (R 4.1.0)
##  purrr         * 0.3.4   2020-04-17 [1] RSPM (R 4.1.0)
##  R6              2.5.0   2020-10-28 [1] RSPM (R 4.1.0)
##  RColorBrewer    1.1-2   2014-12-07 [1] RSPM (R 4.1.0)
##  Rcpp            1.0.7   2021-07-07 [1] RSPM (R 4.1.0)
##  readr         * 2.0.0   2021-07-20 [1] RSPM (R 4.1.0)
##  readxl          1.3.1   2019-03-13 [1] RSPM (R 4.1.0)
##  reprex          2.0.1   2021-08-05 [1] RSPM (R 4.1.0)
##  rfm           * 0.2.2   2020-07-21 [1] RSPM (R 4.1.0)
##  rio             0.5.27  2021-06-21 [1] RSPM (R 4.1.0)
##  rlang         * 0.4.11  2021-04-30 [1] RSPM (R 4.1.0)
##  rmarkdown       2.10    2021-08-06 [1] RSPM (R 4.1.0)
##  rmdformats      1.0.2   2021-04-19 [1] RSPM (R 4.1.0)
##  rstatix         0.7.0   2021-02-13 [1] RSPM (R 4.1.0)
##  rstudioapi      0.13    2020-11-12 [1] RSPM (R 4.1.0)
##  rvest           1.0.1   2021-07-26 [1] RSPM (R 4.1.0)
##  sass            0.4.0   2021-05-12 [1] RSPM (R 4.1.0)
##  scales        * 1.1.1   2020-05-11 [1] RSPM (R 4.1.0)
##  scatterplot3d   0.3-41  2018-03-14 [1] RSPM (R 4.1.0)
##  sessioninfo     1.1.1   2018-11-05 [1] RSPM (R 4.1.0)
##  SnowballC       0.7.0   2020-04-01 [1] RSPM (R 4.1.0)
##  stringi         1.7.3   2021-07-16 [1] RSPM (R 4.1.0)
##  stringr       * 1.4.0   2019-02-10 [1] RSPM (R 4.1.0)
##  tibble        * 3.1.3   2021-07-23 [1] RSPM (R 4.1.0)
##  tidygraph     * 1.2.0   2020-05-12 [1] RSPM (R 4.1.0)
##  tidyr         * 1.1.3   2021-03-03 [1] RSPM (R 4.1.0)
##  tidyselect      1.1.1   2021-04-30 [1] RSPM (R 4.1.0)
##  tidytext      * 0.3.1   2021-04-10 [1] RSPM (R 4.1.0)
##  tidyverse     * 1.3.1   2021-04-15 [1] RSPM (R 4.1.0)
##  tokenizers      0.2.1   2018-03-29 [1] RSPM (R 4.1.0)
##  tzdb            0.1.2   2021-07-20 [1] RSPM (R 4.1.0)
##  utf8            1.2.2   2021-07-24 [1] RSPM (R 4.1.0)
##  vctrs           0.3.8   2021-04-29 [1] RSPM (R 4.1.0)
##  visNetwork      2.0.9   2019-12-06 [1] RSPM (R 4.1.0)
##  vroom           1.5.4   2021-08-05 [1] RSPM (R 4.1.0)
##  withr           2.4.2   2021-04-18 [1] RSPM (R 4.1.0)
##  wordcloud2    * 0.2.1   2018-01-03 [1] RSPM (R 4.1.0)
##  xfun            0.25    2021-08-06 [1] RSPM (R 4.1.0)
##  xml2            1.3.2   2020-04-23 [1] RSPM (R 4.1.0)
##  yaml            2.2.1   2020-02-01 [1] RSPM (R 4.1.0)
##  zip             2.2.0   2021-05-31 [1] RSPM (R 4.1.0)
##  zoo           * 1.8-9   2021-03-09 [1] RSPM (R 4.1.0)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library